This project will aim to examine the underlying associations between specific age groups and different crimes within the neighborhoods of Toronto. Two different datasets will be utilized in this study; the first dataset is the “Toronto Neighbourhoods_shp” file which is a shape file that carries the’sf’ class according to R. It has 140 rows and 44 columns. Each row corresponds to a specific neighborhood within the city of Toronto and the columns correspond to different variables containing information about each neighborhood. These variables include data on the total area, total population, population of different age groups, different languages spoken, and the geographical coordinates of each neighborhood. The next dataset (“neighborhood-crime-rates - 4326-crime-rates - 4326.shp”) was downloaded from the city of Toronto website. This is also a shape file with a “sf” class and it has 158 rows and 185 columns. The rows represent specific neighborhoods within the city and the variables give information on the number of crimes and crimes rates in each neighborhood from the years 2014 to 2023. We will look at the proportion of four different age groups in each neighborhood and consider people aged 0-19 as “Children and Teens”, people aged 20-39 as “Young Adults”, people aged 40-59 as “Middle-Aged Adults and finally, people aged 60 and above as”Seniors”. Now according to Statistics Canada, a 2021 Census reported that downtown Toronto is a hotspot for millenials. Additionally, they reported that people who are between the ages of 15-64 account for 81.2% of downtown’s Toronto population which implies that a low population of seniors (i.e people aged 60 and above) live outside of the downtown core. They also reported that there were a smaller proportion of children in the downtown core and that seniors show pockets of concentration across Toronto in Neighborhoods like Rosedale-Moore Park and other northern Scarborough areas. Additionally, we will use the crime data set to fit spatial models that look at the association between each age group and other demographic variables. The independent variables we will focus on are: assault rate for 2023, homicide rate for 2023, and autotheft rate for 2023. Finally, we will construct more complex spatial models using different predictors from the “Toronto Neighbourhoods_shp” file such as ‘Population of Males’ and ‘Population of Females’.
Can we discover significant spatial trends in the distribution of different age groups across Toronto’s neighborhoods and ultimately confirm the findings from the 2021 census? Additionally, can we identify any linear relationships between the different age groups and crime rates?
Given the report of the 2021 census we expect the following results in our analysis: - signficant spatial correlation for the Young-Adults age group within the downtown neighborhoods since this is a ‘hotspot’ for millennials. - significant spatial autocorrelation for the Seniors age group within the downtown neighborhoods since 81.2% of the downtown population consist of people between the ages of 15-64. - significant spatial autocorrelation for the Seniors age group in the Northern Scarborough neighborhoods Note, we also expect to see more spatial trends that weren’t highlighted within the 2021 census considering census’ and visualization are not the most accurate methods for discovering spatial trends.
To begin with, we will merge the two data sets together using ‘st_intersects’ and then create four different sub populations according to age: Children and Teens (ages 0-19), Young adults (ages 20-39), Middle Aged Adults (ages 40-59), and Seniors (ages 60 and above). Each sub population will be calculated as a proportion of the total population, thus, the variables that we will add to our data set will have a proportion of each sub population for each respective neighborhood. It is important to note that the merged data sets will have exactly 140 columns (i.e 140 neighborhoods) since the ‘Toronto Neighborhoods’ shp. file has less rows than the ‘Neighborhood crime rates’ file.
Now, since we are working with areal data, we need to represent the proximity between areal units (i.e neighborhoods) with an adjacency matrix (proximity matrix). We will use both border based and distance based methods in order to accomplish this and using more than one method will hopefully show consistency within our results. For the border based method, we will use “Queen” connectivity and for the distance based we will use K-Nearest Neighbors (kNN) with k = 4,5, and 6. We chose these methods in particular because each neighborhood will have a minimum of four or a maximum of six neighbors. We believe that this interval is the optimal size for spatial trends to exist. From here, we will construct weight matrices for each method and the weights will be row standardized. For all four border methods, we will conduct a Moran’s I test in each subgroup which gives a total of 16 total tests with four for each group. The Moran’s I is a statistical test defined as:
\[ I = \frac{1}{s^2} \frac{\Sigma_{i}\Sigma_{j}w_{ij}(y_{i}-\overline{y})(y_{j}-\overline{y})}{\Sigma_{i}\Sigma_{j}w_{ij}} \]
Consequently, we expect to see significant spatial autocorrelation when \(I > 0\) or \(I < 0;\) \(\implies p < 0.05\). After comparing the Moran’s I test statistic between the four border based methods we will choose the most optimal one (i.e lowest p-value). Furthermore, we will use the montecarlo approach and correlograms of the Moran’s I statistic in order to determine which weight matrix is best to use. Then, using our chosen matrix, we will quantify local spatial autocorrelation between Toronto neighborhoods using the the Local Moran’ I and the Getis-Ord G* test. Both of these tests are Local Indexes of Spatial Autocorrelation (LISA) which means that they can give an indication of the extent of spatial clustering around one areal unit rather than the whole region. Similar to the global Moran’s I, the Local Moran’s I shows significant spatial autocorrelation locally when \(I > 0;(p < 0.05)\) or \(I < 0;(p< 0.05)\) and the Getis_Ord G* produces a z-score represented by the variable \(G_{i}^{*}\). A group of areal units with high \(G_{i}^{*};(p<0.05)\) indicates a ‘hotspot’ whereas a low \(G_{i}^{*};(p < 0.05)\) indicates a ‘coldspot’.
For the next part of our study, we aim to predict the rates of different crimes in Toronto by using the four age groups as linear predictors. For each of the crime rates (assault, autotheft, and homicide) we will first scale the response variables and then construct a simple linear regression model using the four age groups as predictors and then use backwards selection to find the most optimal model. Then, we will conduct a Moran’s I test on the residuals in order to see if there is any spatial dependence between the error terms. Next, we need to account for the spatial dependence between neighborhoods by using autoregressive terms in our linear model which begs the need for simaultaneous-autoregressive models (SAR) and conditional-autoregressive models (CAR). There are three types of SAR models: SAR error (spatial dependence in the error terms only), SAR lag (spatial dependence in the lag terms only), and SAR mixed (spatial dependence in both lag and error). A CAR model in contrast defines spatial dependence by specifying a Gaussian conditional distribution for each observation given its neighbors. Finally, the last model type we will construct is a linear mixed effect model (LMM) with a random intercept term for each neighborhood. This model will account for spatial dependence by specifying a Matern correlation structure in the error terms. Now, using the ‘optimal’ predictors from the simple linear regression model, we will construct the five models stated above and compare each of them with the likelihood ratio test. We will also compute the Moran’s I test statistic for the residuals of each model and examine the spatial dependence in the residuals. In the final part of our study, we will add two new predictors in our model: ‘Pop_Males’ and ‘Pop_Female’. We will first scale both of these variables before including them and then repeat the steps outlined before. The final models will be compared using the likelihood ratio test.
| Children_Teens | Young_Adults | Middle_aged | Seniors | |
|---|---|---|---|---|
| Min. | 0.0684424 | 0.1738382 | 0.1904885 | 0.0788095 |
| 1st Qu. | 0.1882813 | 0.2515311 | 0.2783513 | 0.1695733 |
| Median | 0.2125895 | 0.2798399 | 0.2959825 | 0.1972979 |
| Mean | 0.2107367 | 0.2943919 | 0.2934439 | 0.2013889 |
| 3rd Qu. | 0.2432896 | 0.3144555 | 0.3120772 | 0.2279440 |
| Max. | 0.3217165 | 0.6107143 | 0.3470546 | 0.3246692 |
Figure 1:Comapring popualtion distribution of age groups across 140 Toronto Neighbourhoods
Figure 2:Histograms for the proportions of each age group
Table 1 shows that ‘Young Adults’ and ‘Middle-Aged Adults’ have the two highest average proportion of individuals across Toronto Neighborhoods (0.294 and 0.293 respectively) among the four categories. ‘Seniors’ have the lowest average proportion at 0.201 while ‘Children and Teens’ have an average of 0.21. The boxplots in Figure 1 give a visual representation of the summary statistics in Table 1 where the differences in mean and median (black line) between the groups are easily discernible. We also see several outliers in the ‘Young Adults’ and ‘Middle-Aged Adults’ boxplots which indicates that there are several neighborhoods with an outstanding proportion of Young Adults and several other neighborhoods with an uncharacteristically low amount of Middle Aged individuals. Figure 2 shows the distribution of the proportion of individuals in each neighborhood with respect to each age group. The ‘Children and Teens’ histogram is a unimodal, wide histogram with a peak between the 0.2-0.25 interval. The ‘Young Adults’ histogram is a narrow, unimodal histogram that is right skewed and peaks between the 0.25-0.3 interval. The ‘Middle-Aged’ adults histogram is left skewed, wide, and has a single peak between the 0.25 and 0.35 interval. Finally, the ‘Seniors’ histogram is relatively wide and almost resembles a uniform distribution between the 0.15 and 0.25 interval, the largest peak is between 0.175 and 0.2.
Map’s 1-4 give insight on potential neighborhood clusters and patterns with respect to each age group. Map 1 shows that the majority of downtown neighborhoods fall within the 0-25th percentile for the population of Children and Teens. Map 2 shows that the downtown neighborhoods have a higher proportion of ‘Young Adults’ while Map 3 shows a even distribution of ‘Middle Aged Adults’ across Toronto. Finally, Map 4 indicates that the downtown core neighborhoods fall within the 0-25th percentile for the Senior population.
| Moran.s.I | p.value | |
|---|---|---|
| Queen | 0.5904388 | 0 |
| KNN N=4 | 0.5887324 | 0 |
| KNN N=5 | 0.5401040 | 0 |
| KNN N=6 | 0.5254209 | 0 |
Figure 3: Correlograms for the Moran’ I test: Children/Teens
| Moran.s.I | p.value | |
|---|---|---|
| Queen | 0.6112999 | 0 |
| KNN N=4 | 0.6054979 | 0 |
| KNN N=5 | 0.5724384 | 0 |
| KNN N=6 | 0.5465047 | 0 |
Figure 4: Correlograms for the Moran’s I test: Young Adults
| Moran.s.I | p.value | |
|---|---|---|
| Queen | 0.4928532 | 0 |
| KNN N=4 | 0.4679394 | 0 |
| KNN N=5 | 0.4380185 | 0 |
| KNN N=6 | 0.4151052 | 0 |
Figure 5: Correlograms for the Moran’s I test: Middle-Aged Adults
| Moran.s.I | p.value | |
|---|---|---|
| Queen | 0.2082105 | 1.04e-05 |
| KNN N=4 | 0.2657183 | 1.00e-06 |
| KNN N=5 | 0.2534780 | 2.00e-07 |
| KNN N=6 | 0.2368294 | 1.00e-07 |
Figure 6: Correlograms for the Moran’s I: Seniors
Figure 7: Monte Carlo simulations for the Moran’s I tests in each age group with Queen connectivity
In all of the Moran’s I tests, \(p < 0.05\) and the Moran’s I test statistic was always greater than 0. Also, for the ‘Children and Teens’,‘Young Adults’, and ‘Middle-Aged Adults’ age groups, the Moran’s I showed the highest test statistic and the lowest p-value for the Queen connectivity. In contrast, the ‘Seniors’ Moran’s I test had the highest test statistic and the lowest p-value for the KNN N=4 method. The results for the Monte Carlo simulation in Figure 7 shows the exact same test statistic (red line) as the ‘Queen’ Moran’s I test in tables 2,3,4 and 5. Finally, the correlograms in Figures 3,4,5, and 6 show the least amount of lags for the Moran’s I test for the Queen border based method in all four age groups (approximately 2-3 lags). Consequently, we used the ‘Queen’ neighborhood for the rest of our analysis because it had the highest Moran’s I test statistic and was thus the most suitable for indenitfying significant spatial clusters.